Skip to content

feat(example): add a kv example based on fjall kv store#1658

Open
ariesdevil wants to merge 1 commit intodatabendlabs:mainfrom
ariesdevil:dev_ariesdevil
Open

feat(example): add a kv example based on fjall kv store#1658
ariesdevil wants to merge 1 commit intodatabendlabs:mainfrom
ariesdevil:dev_ariesdevil

Conversation

@ariesdevil
Copy link
Contributor

@ariesdevil ariesdevil commented Feb 10, 2026

Add a fjall based kv example. fjall is a lsm kv store that is written in pure Rust. Compared to RocksDB, it can accelerate compilation time.

fjall lack of remove_range but it will impl soon I think.

Checklist

  • Updated guide with pertinent info (may not always apply).
  • Squash down commits to one or two logical commits which clearly describe the work you've done.
  • Unittest is a friend:)

This change is Reviewable

@ariesdevil ariesdevil force-pushed the dev_ariesdevil branch 3 times, most recently from 5bf7453 to 49de246 Compare February 10, 2026 14:20
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drmingdrmer reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status: 4 of 18 files reviewed, 1 unresolved discussion (waiting on ariesdevil).


examples/raft-kv-fjall/src/log_store.rs line 185 at r1 (raw file):

        for k in start_index..10_000 {
            self.keyspace_logs().remove(id_to_bin(k)).map_err(|e| io::Error::other(e.to_string()))?;
        }

you must not truncate a log from the left boundary if the server crashed it will leave a hole in the logs that is forbidden. This has been documented in the trait method.

Code quote:

        for k in start_index..10_000 {
            self.keyspace_logs().remove(id_to_bin(k)).map_err(|e| io::Error::other(e.to_string()))?;
        }

@drmingdrmer
Copy link
Member

Thanks for the contribution! A fjall-based storage backend is a great idea — pure Rust, LSM-tree, faster compile times compared to RocksDB. That addresses a real pain point.

One suggestion on scope: rather than a full KV application example (with HTTP API, network layer, main.rs, and test-cluster.sh), would you consider trimming this down to a storage-only example, similar to examples/rocksstore?

rocksstore contains only:

  • log_store.rsRaftLogStorage impl
  • state_machine.rsRaftStateMachine impl
  • lib.rs and a unit test wired to the storage test suite

No HTTP server, no CLI binary, no network glue. The storage test suite (Suite::test_all) gives it solid coverage without any of that scaffolding.

The reason: full application examples tend to require more ongoing maintenance (dependency updates, API compatibility, network stack changes), and they duplicate what raft-kv-memstore already demonstrates. A focused storage implementation would be immediately useful to anyone evaluating fjall as a backend, easier to maintain, and a cleaner fit alongside rocksstore.

Happy to help shape it if you want to go that route.

@drmingdrmer
Copy link
Member

Another consideration worth thinking about: log storage and state machine storage are independent in openraft, and they serve very different roles.

Log storage is on the critical path of consensus — every append, vote, and commit goes through it, so it needs low write latency and sequential append performance. LSM-tree engines like fjall are optimized for write throughput and point lookups, but the compaction overhead and write amplification make them a less natural fit for a Raft log compared to append-oriented storage.

The state machine, on the other hand, is much less latency-sensitive from the consensus perspective. Its only hard requirement for consensus is the ability to produce a snapshot. An LSM-tree engine is actually a good fit here: random reads and writes, key-value lookups, compaction — all align well with typical state machine workloads.

There is already a standalone state machine example at examples/sm-mem that shows this separation in practice.

Have you considered implementing fjall as a RaftStateMachine only, leaving log storage to a more append-friendly backend? That would highlight one of openraft's design strengths (the two traits are fully decoupled), play to fjall's actual strengths, and give users a practical example of mixing storage backends.

@marvin-j97
Copy link
Contributor

Log storage is on the critical path of consensus — every append, vote, and commit goes through it, so it needs low write latency and sequential append performance. LSM-tree engines like fjall are optimized for write throughput and point lookups, but the compaction overhead and write amplification make them a less natural fit for a Raft log compared to append-oriented storage.

If the writes are strictly sequential, fjall does not actually compact anything, it just uses trivial moves. RocksDB should be doing the same. Even then, as long as the write workload is not way too intensive, compactions don't affect latencies that much (especially when your writes are synchronous). The base write amp (without compaction) is ~2x, which is really not that much.

@ariesdevil
Copy link
Contributor Author

Happy to help shape it if you want to go that route.

Sure, thx.

@drmingdrmer
Copy link
Member

drmingdrmer commented Feb 21, 2026

If the writes are strictly sequential, fjall does not actually compact anything, it just uses trivial moves. RocksDB should be doing the same. Even then, as long as the write workload is not way too intensive, compactions don't affect latencies that much (especially when your writes are synchronous). The base write amp (without compaction) is ~2x, which is really not that much.

Fair point on the compaction side — for a strictly sequential append workload, trivial moves do apply and compaction pressure stays low.

But there is a deeper structural issue beyond compaction: fjall has its own WAL (it calls it a "journal"). Looking at the source, it sits in src/journal/ and supports three persist modes: Buffer (OS cache only), SyncData (fdatasync), and SyncAll (fsync). When a write returns from fjall, the data has gone through fjall's journal first, then eventually lands in an SST file.

This creates a durability accounting problem for Raft log storage. Raft requires that once append() signals completion via IOFlushed, the entries are durable. To satisfy that with fjall underneath, you have to fsync through fjall's journal on every append — which means you are paying for two WALs on the write path: fjall's journal and the implicit durability contract of the Raft log itself. You do not get to bypass fjall's journaling to write directly to disk.

For a state machine this does not matter at all, since the state machine can always be rebuilt by replaying the log. Raft never requires the state machine to be durable — only the log must be. That is exactly why fjall looks like a strong fit for RaftStateMachine and a more awkward one for RaftLogStorage.

@marvin-j97
Copy link
Contributor

To satisfy that with fjall underneath, you have to fsync through fjall's journal on every append

You do not get to bypass fjall's journaling to write directly to disk.

What's the alternative? Writing to a file manually has the same fsync costs. But with an LSM you get automatic spilling to SSTs if the journal gets too large, and the WAL is checksummed by default to prevent recovering corrupted data.
Flushing to SSTs happens in the background so it does not impact writers much.

@drmingdrmer
Copy link
Member

drmingdrmer commented Feb 21, 2026

To satisfy that with fjall underneath, you have to fsync through fjall's journal on every append
You do not get to bypass fjall's journaling to write directly to disk.

What's the alternative? Writing to a file manually has the same fsync costs. But with an LSM you get automatic spilling to SSTs if the journal gets too large, and the WAL is checksummed by default to prevent recovering corrupted data. Flushing to SSTs happens in the background so it does not impact writers much.

flushing to sst is unnecessary. this is the point.

More or less, there is an extra disk write burden.

I still want such an example to encourage user application to use most appropriate pure WAL-like storage other than a LSM based storage. But you said that the impact is negligible, so it is okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants